Evaluation of a Sentence Ranker for Text Summarization Based on Roget's Thesaurus
نویسندگان
چکیده
Evaluation is one of the hardest tasks in automatic text summarization. It is perhaps even harder to determine how much a particular component of a summarization system contributes to the success of the whole system. We examine how to evaluate the sentence ranking component using a corpus which has been partially labelled with Summary Content Units. To demonstrate this technique, we apply it to the evaluation of a new sentence-ranking system which uses Roget’s Thesaurus. This corpus provides a quick and nearly automatic method of evaluating the quality of sentence ranking. 1 Motivation and Related Work One of the hardest tasks in Natural Language Processing is text summarization: given a document or a collection of related documents, generate a (much) shorter text which presents only the main points. A summary can be generic – no restrictions other than the required compression – or query-driven, when the summary must answer a few questions or focus on the topic of the query. Language generation is a hard problem, so summarization usually relies on extracting relevant sentences and arranging them into a summary. While it is, on the face of it, easy to produce some summary, a good summary is a challenge, so evaluation is essential. We discuss the use of a corpus labelled with Summary Content Units for evaluating the sentence ranking component of a query-driven extractive text summarization system. We do it in two ways: directly evaluate sentence ranking using Macro-Average Precision; and evaluate summaries generated using that ranking, thus indirectly evaluating the ranking system itself. The annual Text Analysis Conference (TAC; formerly Document Understanding Conference, or DUC), organized by The National Institute of Standards and Technology (NIST), includes tasks in text summarization. In 2005-2007, the challenge was to generate 250-word query-driven summaries of news article collections of 20-50 articles. In 2008-2009 (after a 2007 pilot), the focus has shifted to creating update summaries where the document set is split into a few subsets, from which 100-word summaries are generated. 2 Alistair Kennedy and Stan Szpakowicz As opposed to the international media hype that surrounded last week’s flight, with hundreds of journalists on site to capture the historic moment, Airbus chose to conduct Wednesday’s test more discreetly. After its glitzy debut, the new Airbus super-jumbo jet A380 now must prove soon it can fly, and eventually turn a profit. “The takeoff went perfectly,” Alain Garcia, an Airbus engineering executive, told the LCI television station in Paris. Fig. 1. Positive, negative and neutral sentence examples for the query “Airbus A380 – Describe developments in the production and launch of the Airbus A380”. 1.1 Summary Evaluation One kind of manual evaluation at DUC/TAC is a full evaluation of the readability and responsiveness of the summaries. Responsiveness tells us how good a summary is; the score should be a mix of grammaticality and content. Another method of manual evaluation is pyramid evaluation [1]. It begins with creating several reference summaries and determining what information in them is most relevant. Each relevant element is called a Summary Content Unit (SCU), carried in text by a fragment, from a few words to a sentence. All SCUs are marked in the reference summaries and make up a so-called pyramid, with few frequent SCUs at the top and many rarer ones at the bottom. In the pyramid evaluation proper, annotators identify SCUs in peer summaries. The SCU count tells us how much relevant information a peer summary contains, and what redundancy there is if a SCU appears more than once. The modified pyramid score measures the recall of SCUs in a peer summary [2]. 1.2 The SCU-Labelled Corpus Pyramid evaluation supplies fully annotated peer summaries. Those are usually extractive summaries, so one can map sentences in them back to the original corpus [3]. Many sentences in the corpus can be labeled with the list of SCUs they contain, as well as the score for each of these SCUs and their identifiers. [3] reported that 83% of the sentences from the peer summaries in 2005 and 96% of the sentences from the peer summaries in 2006 could be mapped back to the original corpus. A dataset has been generated for the DUC/TAC main task data in years 2005-2009, and the update task in 2007. We consider three kinds of sentences, illustrated in Figure 1. First, a positive example: its tag shows its use in summary with ID 0, and lists two SCUs: with ID 11, weight 4, and with ID 12, weight 2. The second sentence – a negative example – has a SCU count of 0, but is annotated because of its use in summaries 14, 44 and 57. The third unlabelled sentence was not used in any Evaluation of a Sentence Ranker. . . 3 summary: no annotation. The data set contains 19,248 labelled sentences from a total of 91,658 in 277 document sets. The labelled data are 39.7% positive. Parts of the SCU-labelled corpus have been used in other research. In [4], the 2005 data are the means for evaluating two sentence-ranking graph-matching algorithms for summarization. The rankers match the parsed sentences in the query with parsed sentences in the document set. For evaluation, summaries were constructed from the highest-ranked sentences. The sum of sentence SCU scores was the score for the summary. One problem with this method is that both labelled and unlabelled data were used in this evaluation, thus making the summary SCU scores a lower bound on the expected scores of the summary. Also the method does not directly evaluate a sentence ranker on its own, but rather in tandem with a simple summarization system. In [5], an SVM is trained on positive and negative sentences from the 2006 DUC data and tested on the 2005 data. The features include sentence position, overlap with the query and others based on text cohesion. In [6], the SCUlabelled corpus is used to find a baseline algorithm for update summarization called Sub-Optimal Position Policy (SPP), an extension of Optimal Position Policy (OPP) [7]. In [8], the corpus from 2005-2007 is used to determine that summaries generated automatically tend to be query-biased (select sentences to maximize overlap with a query) rather than query-focused (answer a query).
منابع مشابه
Entropy-based Sentence Selection with Roget's Thesaurus
This year at the University of Ottawa we submitted two systems to the Guided Summarization challenge. In our submissions we tested how well an entropy-based measure of sentence selection worked against a baseline system. The entropybased sentence selector showed improvement over the baseline: it increased the number of unique Summary Content Units selected, and reduced the number of redundant S...
متن کاملNot as Easy as It Seems: Automating the Construction of Lexical Chains Using Roget's Thesaurus
Morris and Hirst [10] present a method of linking significant words that are about the same topic. The resulting lexical chains are a means of identifying cohesive regions in a text, with applications in many natural language processing tasks, including text summarization. The first lexical chains were constructed manually using Roget’s International Thesaurus. Morris and Hirst wrote that autom...
متن کاملارائه سیستم خلاصه ساز متون فارسی برمبنای ویژگی های زبان شناختی و رگرسیون
Considering the vast amount of existing written information and the shortage of time, optimal summarization of books, articles, news reports, etc. on the Web is a major concern of researchers. In this paper, we propose a new approach for Persian single-document Summarization based on several linguistic features of text. In our approach after extracting the linguistic features for each sentence,...
متن کاملText Segmentation Using Roget-Based Weighted Lexical Chains
In this article we present a new method for text segmentation. The method relies on the number of lexical chains (LCs) which end in a sentence, which begin in the following sentence and which traverse the two successive sentences. The lexical chains are based on Roget’s thesaurus (the 1987 and the 1911 version). We evaluate the method on ten texts from the DUC 2002 conference and on twenty text...
متن کاملFast Semantic Relatedness: WordNet: : Similarity vs Roget's Thesaurus
A Measure of Semantic Relatedness (MSR) automatically determines how close two words are in meaning. MSRs are used in such Natural Language Processing (NLP) problems as word-sense disambiguation or text summarization. To solve such problems may require millions of relatedness scores, but MSR run-time, clearly a major concern, has rarely been considered in NLP research. To evaluate an MSR, one o...
متن کامل